Skip to content

use django-post-office with a django_ses backend for emails #6533

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Mar 4, 2025

Conversation

escattone
Copy link
Contributor

@escattone escattone commented Feb 28, 2025

mozilla/sumo#1993
mozilla/sumo#2215

This PR should solve our email issues.

Important Notes

  • The post_office email backend simply queues emails in the DB. You have to tell it what "real" email backend to use for actually sending the emails. In this case, we're using the django_ses.SESBackend. When post_office creates an instance of the "real" backend in order to actually send the emails, it always creates it with fail_silently=False, because it captures any exception when trying to send an email, marks that email as "failed", and schedules it for a retry. So, it really doesn't matter what value for fail_silently you pass to the post_office.EmailBackend, it's always ignored by that backend.
  • We're using the built-in way that post_office queues a Celery task -- because we're setting its CELERY_ENABLED setting to True. This means that its built-in post_office.tasks.send_queued_mail Celery task is kicked-off each time either the post_office.EmailBackend.send_messages method or the post_office.mail.send function is called. It's important to keep in mind that the post_office.tasks.send_queued_mail Celery task is not given a specific list of queued emails to send. Instead, it runs a loop of querying the database for a batch of queued emails, sending that batch using the real backend (in this case django_ses.SESBackend), updating the status of the emails in the DB, creating DB logs, and so on until its DB query for the next batch of emails returns nothing. The important point I'm trying to make here is that each time we call send_messages in our code, we're kicking off a Celery task that will work on all queued emails, not just the ones in the send_messages call. That's why in this PR I've modified our calls to send_messages to always pass in the entire set of messages we want to send, so no longer calling send_messages for each individual message.
  • The post_office.tasks.send_queued_mail task assumes that all of the Celery workers that could handle that task share the same file system. If not, its file-based locking system fails to limit the execution of post_office.mail.send_queued to a single run at a time, which in turn means that it's highly likely that duplicate emails would be sent. In our case, each of our Celery workers is its own pod, so only the 4 worker processes within that pod share the same file system, but not all of our workers. The root cause of this issue is that the send_queued function does not work on a specific set of emails, but instead, as mentioned earlier, its work consists of querying the DB for a batch of emails, sending them, and then updating their status, and that work is not done atomically. So if more than one send_queued function is running at the same time (highly likely given our current state), it's likely that they grab the same batch of emails and send them. The solution I've taken is to create a separate Celery queue for emails -- the new email queue -- as well as another Celery K8s deployment with a single replica (pod) that pulls only from the email queue (see https://github.com/mozilla-it/webservices-infra/pull/4127). Also, the other Celery workers only pull tasks from the default celery queue, not from the email queue, so effectively, only a single pod is ever pulling tasks from the email queue.
  • I've configured post_office to log (in the database) both successful and failed emails (which is the default actually). The log entries are available via the admin interface.
  • In the near future, assuming we go with this PR, we should probably create a periodic task that deletes email log entries older than some period of time, maybe something like 1-3 months.
  • For QA/testing, post_office provides the OVERRIDE_RECIPIENTS setting, which specifies one or more email addresses that you'd like to always use instead of an email's actual recipient. This PR goes a step further -- mimicking what the django-email-bandit package does -- by indicating the original recipient within the email's text and HTML alternatives. This is really important, because it allows us to perform QA on stage using the real email system without sending unwanted emails. On stage, all emails would go to the email(s) defined in the OVERRIDE_RECIPIENTS setting. This also means that we no longer need the django-email-bandit package.

Next Steps

  • In AWS, create an IAM policy sumo-ses-v2-send-and-getaccount for SES v2 access. This has already been done.
  • In AWS, add the sumo-ses-v2-send-and-getaccount IAM policy to the ses-smtp-user.prod user.
  • In AWS, create a new access key for the ses-smtp-user.prod user.
  • In AWS, add the sumo-ses-v2-send-and-getaccount IAM policy to the ses-smtp-user.dev and ses-smtp-user.stage users.
  • In AWS, create new access keys for the ses-smtp-user.dev and ses-smtp-user.stage users.
  • (Optional) Formalize the AWS changes in TerraForm (if needed).
  • Add the AWS SES access keys to the dev, stage, and prod (and don't forget the European prod secrets) in GCP.
  • Add the POST_OFFICE_OVERRIDE_RECIPIENTS secret to the dev and stage secrets in GCP. It should be set to the same value as is currently defined for the BANDIT_EMAIL setting.
  • Review and merge https://github.com/mozilla-it/webservices-infra/pull/4127. Since it only adds K8s resources rather than deleting or modifying them, it's safe to merge at any time.
  • Merge this PR to stage for testing. Since we'd be using the same email system as production -- but of course, with all emails sent to the POST_OFFICE_OVERRIDE_RECIPIENTS recipients -- we can use stage to QA the real email system.
  • Release after QA green light.

Post-Release Steps

  • Create a PR in webservices-infra that removes the EMAIL_LOGGING_REAL_BACKEND setting from values-dev.yaml, values-stage.yaml, values-stage-europe-west1.yaml, values-prod.yaml, and values-prod-europe-west1.yaml.
  • Create a PR that removes the django-email-bandit package.
  • Remove the BANDIT_EMAIL setting within the dev and stage secrets.

@escattone escattone requested a review from akatsoulas February 28, 2025 00:48
Copy link
Collaborator

@akatsoulas akatsoulas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Configuration looks good. Don't we need though to use the send method from post office? What are the next steps?

@escattone
Copy link
Contributor Author

escattone commented Feb 28, 2025

Configuration looks good. Don't we need though to use the send method from post office?

No, because the post_office.EmailBackend.send_messages method, which is what we're relying on, does the exact same thing as the post_office.mail.send function. If we use the post_office.mail.send function, it locks us into using post_office, and we would not be able to switch between email backends. I was thinking that for now, we may not be ready to make that commitment? I'm willing to do that within this PR though, if you'd prefer that.

What are the next steps?

I added a list of next steps in the introduction.

@escattone escattone force-pushed the django-post-office-integration branch from 4ca8027 to 548f16b Compare March 3, 2025 20:06
@escattone escattone force-pushed the django-post-office-integration branch from 548f16b to d1904cc Compare March 3, 2025 20:13
@@ -1,3 +1,3 @@
#!/bin/bash

exec newrelic-admin run-program celery -A kitsune worker -Q celery,priority --max-tasks-per-child=${CELERY_WORKER_MAX_TASKS_PER_CHILD:-25}
exec newrelic-admin run-program celery -A kitsune worker -Q "${1:-celery}" --max-tasks-per-child=${CELERY_WORKER_MAX_TASKS_PER_CHILD:-25}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nicely done

@escattone escattone merged commit 64ac1c7 into mozilla:main Mar 4, 2025
2 checks passed
@escattone escattone deleted the django-post-office-integration branch March 4, 2025 18:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants